conformational ensemble
From Static Structures to Ensembles: Studying and Harnessing Protein Structure Tokenization
Liu, Zijing, Feng, Bin, Cao, He, Li, Yu
Protein structure tokenization converts 3D structures into discrete or vectorized representations, enabling the integration of structural and sequence data. Despite many recent works on structure tokenization, the properties of the underlying discrete representations are not well understood. In this work, we first demonstrate that the successful utilization of structural tokens in a language model for structure prediction depends on using rich, pre-trained sequence embeddings to bridge the semantic gap between the sequence and structural "language". The analysis of the structural vocabulary itself then reveals significant semantic redundancy, where multiple distinct tokens correspond to nearly identical local geometries, acting as "structural synonyms". This redundancy, rather than being a flaw, can be exploited with a simple "synonym swap" strategy to generate diverse conformational ensembles by perturbing a predicted structure with its structural synonyms. This computationally lightweight method accurately recapitulates protein flexibility, performing competitively with state-of-the-art models. Our study provides fundamental insights into the nature of discrete protein structure representations and introduces a powerful, near-instantaneous method for modeling protein dynamics. Source code is available in https://github.com/IDEA-XL/TokenMD.
- Asia > China > Guangdong Province > Shenzhen (0.41)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China > Hong Kong (0.04)
Generative Modeling of Full-Atom Protein Conformations using Latent Diffusion on Graph Embeddings
Sengar, Aditya, Hariri, Ali, Probst, Daniel, Barth, Patrick, Vandergheynst, Pierre
Generating diverse, all-atom conformational ensembles of dynamic proteins such as G-protein-coupled receptors (GPCRs) is critical for understanding their function, yet most generative models simplify atomic detail or ignore conformational diversity altogether. We present latent diffusion for full protein generation (LD-FPG), a framework that constructs complete all-atom protein structures, including every side-chain heavy atom, directly from molecular dynamics (MD) trajectories. LD-FPG employs a Chebyshev graph neural network (ChebNet) to obtain low-dimensional latent embeddings of protein conformations, which are processed using three pooling strategies: blind, sequential and residue-based. A diffusion model trained on these latent representations generates new samples that a decoder, optionally regularized by dihedral-angle losses, maps back to Cartesian coordinates. Using D2R-MD, a 2-microsecond MD trajectory (12 000 frames) of the human dopamine D2 receptor in a membrane environment, the sequential and residue-based pooling strategy reproduces the reference ensemble with high structural fidelity (all-atom lDDT of approximately 0.7; C-alpha-lDDT of approximately 0.8) and recovers backbone and side-chain dihedral-angle distributions with a Jensen-Shannon divergence of less than 0.03 compared to the MD data. LD-FPG thereby offers a practical route to system-specific, all-atom ensemble generation for large proteins, providing a promising tool for structure-based therapeutic design on complex, dynamic targets. The D2R-MD dataset and our implementation are freely available to facilitate further research.
- North America > United States (0.28)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Europe > Netherlands (0.04)
- Asia > Middle East > Yemen > Amanat Al Asimah > Sanaa (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Information Technology (0.93)
- Energy (0.92)
- Government > Regional Government > North America Government (0.46)
Learning conformational ensembles of proteins based on backbone geometry
Wolf, Nicolas, Seute, Leif, Viliuga, Vsevolod, Wagner, Simon, Stühmer, Jan, Gräter, Frauke
Deep generative models have recently been proposed for sampling protein conformations from the Boltzmann distribution, as an alternative to often prohibitively expensive Molecular Dynamics simulations. However, current state-of-the-art approaches rely on fine-tuning pre-trained folding models and evolutionary sequence information, limiting their applicability and efficiency, and introducing potential biases. In this work, we propose a flow matching model for sampling protein conformations based solely on backbone geometry. We introduce a geometric encoding of the backbone equilibrium structure as input and propose to condition not only the flow but also the prior distribution on the respective equilibrium structure, eliminating the need for evolutionary information. The resulting model is orders of magnitudes faster than current state-of-the-art approaches at comparable accuracy and can be trained from scratch in a few GPU days. In our experiments, we demonstrate that the proposed model achieves competitive performance with reduced inference time, across not only an established benchmark of naturally occurring proteins but also de novo proteins, for which evolutionary information is scarce.
- Europe > Germany > Baden-Württemberg > Karlsruhe Region (0.14)
- Europe > Sweden (0.14)
- Research Report > Promising Solution (0.54)
- Research Report > New Finding (0.34)
JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling
Daigavane, Ameya, Vani, Bodhi P., Saremi, Saeed, Kleinhenz, Joseph, Rackers, Joshua
They are not well characterized as single structures as has traditionally been the case, but rather as ensembles of structures with an ergodic probability distribution(Henzler-Wildman & Kern, 2007). Protein motion is required for myglobin to bind oxygen and move it around the body (Miller & Phillips, 2021). Drug discovery on protein kinases depends on characterizing kinase conforma-tional ensembles (Gough & Kalodimos, 2024). The search for druggable'cryptic pockets' requires understanding protein dynamics, and antibody design is deeply affected by conformational ensembles (Colombo, 2023). However, while machine learning (ML) methods for molecular structure prediction have experienced enormous success recently, ML methods for dynamics have yet to have similar impact. ML models for generating molecular ensembles are widely considered the'next frontier' (Bowman, 2024; Miller & Phillips, 2021; Zheng et al., 2023).
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > San Mateo County > South San Francisco (0.04)
Score-based Enhanced Sampling for Protein Molecular Dynamics
Lu, Jiarui, Zhong, Bozitao, Tang, Jian
The dynamic nature of proteins is crucial for determining their biological functions and properties, and molecular dynamics (MD) simulations stand as a predominant tool to study such phenomena. By utilizing empirically derived force fields, MD simulations explore the conformational space through numerically evolving the system along MD trajectories. However, the high-energy barrier of the force fields can hamper the exploration of MD, resulting in inadequately sampled ensemble. In this paper, we propose leveraging score-based generative models (SGMs) trained on general protein structures to perform protein conformational sampling to complement traditional MD simulations. We argue that SGMs can provide a novel framework as an alternative to traditional enhanced sampling methods by learning multi-level score functions, which directly sample a diversity-controllable ensemble of conformations. We demonstrate the effectiveness of our approach on several benchmark systems by comparing the results with long MD trajectories and state-of-the-art generative structure prediction models. Our framework provides new insights that SGMs have the potential to serve as an efficient and simulation-free methods to study protein dynamics.
- North America > Canada > Quebec (0.14)
- North America > United States (0.14)
- Europe > United Kingdom > England (0.14)
- Asia (0.14)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Energy > Oil & Gas > Upstream (0.46)
Daily Digest
Diseases that have a complex genetic architecture tend to suffer from considerable amounts of genetic variants that, although playing a role in the disease, have not yet been revealed as such. Here researchers present DiseaseCapsule, as a capsule-network-based approach that explicitly addresses to capture the hierarchical structure of the underlying genome data, and has the potential to fully capture the non-linear relationships between variants and disease. Researchers propose BIGKnock (BIobank-scale Gene-based association test via Knockoffs), a computationally efficient gene-based testing approach for biobank-scale data, that leverages long-range chromatin interaction data, and performs conditional genome-wide testing via knockoffs. Dynamics and conformational sampling are essential for linking protein structure to biological function. While challenging to probe experimentally, computer simulations are widely used to describe protein dynamics, but at significant computational costs that continue to limit the systems that can be studied.
Direct generation of protein conformational ensembles via machine learning
Dynamics and conformational sampling are essential for linking protein structure to biological function. While challenging to probe experimentally, computer simulations are widely used to describe protein dynamics, but at significant computational costs that continue to limit the systems that can be studied. Here, we demonstrate that machine learning can be trained with simulation data to directly generate physically realistic conformational ensembles of proteins without the need for any sampling and at negligible computational cost. As a proof-of-principle we train a generative adversarial network based on a transformer architecture with self-attention on coarse-grained simulations of intrinsically disordered peptides. The resulting model, idpGAN, can predict sequence-dependent coarse-grained ensembles for sequences that are not present in the training set demonstrating that transferability can be achieved beyond the limited training data. We also retrain idpGAN on atomistic simulation data to show that the approach can be extended in principle to higher-resolution conformational ensemble generation. Computational methods to study protein structural dynamics are a powerful tool in life sciences but are computationally expensive. Here, the authors show that machine learning can be used to efficiently generate protein conformational ensembles and test their method on intrinsically disordered peptides.
Artificial intelligence guided conformational mining of intrinsically disordered proteins - Communications Biology
Artificial intelligence recently achieved the breakthrough of predicting the three-dimensional structures of proteins. The next frontier is presented by intrinsically disordered proteins (IDPs), which, representing 30% to 50% of proteomes, readily access vast conformational space. Molecular dynamics (MD) simulations are promising in sampling IDP conformations, but only at extremely high computational cost. Here, we developed generative autoencoders that learn from short MD simulations and generate full conformational ensembles. An encoder represents IDP conformations as vectors in a reduced-dimensional latent space. The mean vector and covariance matrix of the training dataset are calculated to define a multivariate Gaussian distribution, from which vectors are sampled and fed to a decoder to generate new conformations. The ensembles of generated conformations cover those sampled by long MD simulations and are validated by small-angle X-ray scattering profile and NMR chemical shifts. This work illustrates the vast potential of artificial intelligence in conformational mining of IDPs. Generative autoencoders create full conformational ensembles of intrinsically disordered proteins from short molecular dynamics simulations.